Run this Notebook on GCP with Dataproc
Spark can automatically uncompress from a variety of formats
spark.version
'3.1.3'
spark./gateway/default/node/conf?host&port.set("spark.sql.repl.eagerEval.enabled",True)
import os
import subprocess
import shutil
import pandas as pd
# import sh
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
pd.set_option("max_colwidth", 100)
from google.cloud import storage
# !pip install gcsfs --upgrade
# List all files in given COS directory
def list_blobs(bucket_name, folder_name):
gcs_client = storage.Client()
bucket = gcs_client.bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix=folder_name))
for blob in blobs:
print(blob.name + '\t' + str(blob.size))
# List all files in given COS directory
def list_blobs_pd(bucket_name, folder_name):
gcs_client = storage.Client()
bucket = gcs_client.bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix=folder_name))
blob_name = []
blob_size = []
for blob in blobs:
blob_name.append(blob.name)
blob_size.append(blob.size)
blobs_df = pd.DataFrame(list(zip(blob_name, blob_size)), columns=['Name','Size'])
blobs_df.style.format({"Size": "{:,.0f}"})
return blobs_df
# Delete folder from COS bucket
def delete_folder(bucket_name, folder_name):
gcs_client = storage.Client()
bucket = gcs_client.bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix=folder_name))
for blob in blobs:
blob.delete()
# It will display the total disk usage of the directory "amazon_reviews" located at the specified path in Google Cloud Storage
!hadoop fs -du -s -h 'gs://msca-bdp-amazon/amazon_reviews/'
Apr 28, 2024 6:52:39 AM com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics updateMinMaxStats INFO: Detected potential high latency for operation op_get_file_status. latencyMs=227; previousMaxLatencyMs=0; operationCount=1; context=gs://msca-bdp-amazon/amazon_reviews Apr 28, 2024 6:52:39 AM com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics updateMinMaxStats INFO: Detected potential high latency for operation op_glob_status. latencyMs=307; previousMaxLatencyMs=0; operationCount=1; context=path=gs://msca-bdp-amazon/amazon_reviews; pattern=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$$Lambda$8/265321659@4baf352a 73.8 G 73.8 G gs://msca-bdp-amazon/amazon_reviews
# Tt will list the contents of the directory "amazon_reviews" located at the specified path in Google Cloud Storage.
!hadoop fs -ls 'gs://msca-bdp-amazon/amazon_reviews/'
Apr 28, 2024 6:52:42 AM com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics updateMinMaxStats INFO: Detected potential high latency for operation op_get_file_status. latencyMs=230; previousMaxLatencyMs=0; operationCount=1; context=gs://msca-bdp-amazon/amazon_reviews Apr 28, 2024 6:52:42 AM com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics updateMinMaxStats INFO: Detected potential high latency for operation op_glob_status. latencyMs=314; previousMaxLatencyMs=0; operationCount=1; context=path=gs://msca-bdp-amazon/amazon_reviews; pattern=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$$Lambda$8/1050065615@15eebbff Found 46 items -rwx------ 3 root root 1971061630 2020-10-28 20:26 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Apparel_v1_00.tsv -rwx------ 3 root root 1350294084 2020-10-28 20:26 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Automotive_v1_00.tsv -rwx------ 3 root root 872274720 2020-10-28 20:26 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Baby_v1_00.tsv -rwx------ 3 root root 2152186111 2020-10-28 20:27 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Beauty_v1_00.tsv -rwx------ 3 root root 6699171766 2020-10-28 20:29 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Books_v1_00.tsv -rwx------ 3 root root 6703921864 2020-10-28 20:31 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Books_v1_01.tsv -rwx------ 3 root root 3238702530 2020-10-28 20:32 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Books_v1_02.tsv -rwx------ 3 root root 1100169988 2020-10-28 20:32 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Camera_v1_00.tsv -rwx------ 3 root root 6696964946 2020-10-28 20:35 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv -rwx------ 3 root root 3224038446 2020-10-28 20:36 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv -rwx------ 3 root root 628880453 2020-10-28 20:36 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv -rwx------ 3 root root 53855391 2020-10-28 20:36 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Digital_Software_v1_00.tsv -rwx------ 3 root root 1288048833 2020-10-28 20:36 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Digital_Video_Download_v1_00.tsv -rwx------ 3 root root 73154460 2020-10-28 20:36 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Digital_Video_Games_v1_00.tsv -rwx------ 3 root root 1725988504 2020-10-28 20:37 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Electronics_v1_00.tsv -rwx------ 3 root root 366979553 2020-10-28 20:37 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Furniture_v1_00.tsv -rwx------ 3 root root 39977611 2020-10-28 20:37 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Gift_Card_v1_00.tsv -rwx------ 3 root root 956224754 2020-10-28 20:37 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Grocery_v1_00.tsv -rwx------ 3 root root 2422088742 2020-10-28 20:38 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Health_Personal_Care_v1_00.tsv -rwx------ 3 root root 500303241 2020-10-28 20:38 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Home_Entertainment_v1_00.tsv -rwx------ 3 root root 1202422006 2020-10-28 20:39 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Home_Improvement_v1_00.tsv -rwx------ 3 root root 2496148207 2020-10-28 20:39 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Home_v1_00.tsv -rwx------ 3 root root 617809191 2020-10-28 20:40 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Jewelry_v1_00.tsv -rwx------ 3 root root 2218102275 2020-10-28 20:40 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Kitchen_v1_00.tsv -rwx------ 3 root root 1148784380 2020-10-28 20:41 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Lawn_and_Garden_v1_00.tsv -rwx------ 3 root root 150519467 2020-10-28 20:41 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Luggage_v1_00.tsv -rwx------ 3 root root 62977388 2020-10-28 20:41 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Major_Appliances_v1_00.tsv -rwx------ 3 root root 1384889812 2020-10-28 20:41 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Mobile_Apps_v1_00.tsv -rwx------ 3 root root 58221236 2020-10-28 20:41 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Mobile_Electronics_v1_00.tsv -rwx------ 3 root root 3671037159 2020-10-28 20:42 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Music_v1_00.tsv -rwx------ 3 root root 475220114 2020-10-28 20:43 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Musical_Instruments_v1_00.tsv -rwx------ 3 root root 1243069057 2020-10-28 20:43 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Office_Products_v1_00.tsv -rwx------ 3 root root 1061801717 2020-10-28 20:43 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Outdoors_v1_00.tsv -rwx------ 3 root root 3648888910 2020-10-28 20:45 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_PC_v1_00.tsv -rwx------ 3 root root 44891575 2020-10-28 20:45 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv -rwx------ 3 root root 1227951681 2020-10-28 20:45 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Pet_Products_v1_00.tsv -rwx------ 3 root root 1570176560 2020-10-28 20:45 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Shoes_v1_00.tsv -rwx------ 3 root root 249565371 2020-10-28 20:46 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Software_v1_00.tsv -rwx------ 3 root root 2007024927 2020-10-28 20:46 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Sports_v1_00.tsv -rwx------ 3 root root 788203941 2020-10-28 20:46 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Tools_v1_00.tsv -rwx------ 3 root root 1962834272 2020-10-28 20:47 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Toys_v1_00.tsv -rwx------ 3 root root 3708889477 2020-10-28 20:48 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Video_DVD_v1_00.tsv -rwx------ 3 root root 1204902329 2020-10-28 20:49 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Video_Games_v1_00.tsv -rwx------ 3 root root 337970606 2020-10-28 20:49 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Video_v1_00.tsv -rwx------ 3 root root 412542975 2020-10-28 20:49 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Watches_v1_00.tsv -rwx------ 3 root root 4198056580 2020-10-28 20:50 gs://msca-bdp-amazon/amazon_reviews/amazon_reviews_us_Wireless_v1_00.tsv
path_read = 'gs://msca-bdp-amazon/amazon_reviews/'
%%time
amazon_reviews = spark.read.csv(path_read, header='true', inferSchema='true', sep='\t', quote='')
[Stage 2:======================================================>(596 + 2) / 598]
CPU times: user 382 ms, sys: 74.3 ms, total: 456 ms Wall time: 3min 32s
%%time
amazon_reviews.count()
[Stage 3:======================================================>(597 + 1) / 598]
CPU times: user 147 ms, sys: 22.1 ms, total: 169 ms Wall time: 47.2 s
150962278
type(amazon_reviews)
pyspark.sql.dataframe.DataFrame
amazon_reviews.first()
Row(marketplace='US', customer_id=22873041, review_id='R3ARRMDEGED8RD', product_id='B00KJWQIIC', product_parent=335625766, product_title='Plemo 14-Inch Laptop Sleeve Case Waterproof Fabric Bag for MacBook Air / Laptops / Notebook, Gray', product_category='PC', star_rating=5, helpful_votes=0, total_votes=0, vine='N', verified_purchase='Y', review_headline='Pleasantly surprised', review_body='I was very surprised at the high quality of the stitching, the sturdiness of the handles and the padding for my laptop. The price is amazingly low and the look is very good. I am quite happy with this purchase. It fit my MacBook Pro perfectly, with a little bit of room to spare.', review_date='2015-08-31')
amazon_reviews.show(3)
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+ |marketplace|customer_id| review_id|product_id|product_parent| product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase| review_headline| review_body|review_date| +-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+ | US| 22873041|R3ARRMDEGED8RD|B00KJWQIIC| 335625766|Plemo 14-Inch Lap...| PC| 5| 0| 0| N| Y|Pleasantly surprised|I was very surpri...| 2015-08-31| | US| 30088427| RQ28TSA020Y6J|B013ALA9LA| 671157305|TP-Link OnHub AC1...| PC| 5| 24| 31| N| N|OnHub is a pretty...|I am a Google emp...| 2015-08-31| | US| 20329786| RUXJRZCT6953M|B00PML2GQ8| 982036237|AmazonBasics USB ...| PC| 1| 2| 2| N| N|None of them work...|Bought cables in ...| 2015-08-31| +-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+ only showing top 3 rows
This will output the results of DataFrames in each step without the need to use df.show() and also improves the formatting of the output
spark./gateway/default/node/conf?host&port.set("spark.sql.repl.eagerEval.enabled",True)
amazon_reviews.limit(5)
| marketplace | customer_id | review_id | product_id | product_parent | product_title | product_category | star_rating | helpful_votes | total_votes | vine | verified_purchase | review_headline | review_body | review_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| US | 39857879 | R3CDKELZC3G7KH | B00KKDAXV8 | 172874826 | ANiceSeller Wall ... | PC | 1 | 0 | 0 | N | Y | One Star | Doesn't work with... | 2015-05-28 |
| US | 1530077 | R1MQ3U9LGHYB1A | B00C7RJQPY | 354869813 | Cable Matters Min... | PC | 5 | 0 | 0 | N | Y | Five Stars | Works perfectly, ... | 2015-05-28 |
| US | 26185761 | R4IFN0GK0S90C | B00OZFFX0I | 36527192 | ProntoTec 7 inch ... | PC | 1 | 0 | 0 | N | Y | Disappointed again | Once again I 'dre... | 2015-05-28 |
| US | 23132125 | R1PV8JMV2IC0X1 | B0038LIK82 | 614307240 | ineo I-NA316N1-R,... | PC | 4 | 0 | 0 | N | Y | I recently dug th... | I recently dug th... | 2015-05-28 |
| US | 38131659 | R3IUSIL2DUNF7R | B00DQFGJR4 | 934621546 | AmazonBasics USB Hub | PC | 5 | 0 | 0 | N | Y | Fully functional ... | Great USB hub. | 2015-05-28 |
amazon_reviews.printSchema()
root |-- marketplace: string (nullable = true) |-- customer_id: integer (nullable = true) |-- review_id: string (nullable = true) |-- product_id: string (nullable = true) |-- product_parent: integer (nullable = true) |-- product_title: string (nullable = true) |-- product_category: string (nullable = true) |-- star_rating: integer (nullable = true) |-- helpful_votes: integer (nullable = true) |-- total_votes: integer (nullable = true) |-- vine: string (nullable = true) |-- verified_purchase: string (nullable = true) |-- review_headline: string (nullable = true) |-- review_body: string (nullable = true) |-- review_date: string (nullable = true)
# amazon_reviews.describe().show()
# Group by product category and count the number of reviews for each category
category_counts = amazon_reviews.groupBy('product_category').count()
# Find the category with the highest number of reviews
most_reviews_category = category_counts.orderBy(F.desc('count')).first()['product_category']
# Print the result
print("The product category with the most reviews is:", most_reviews_category)
The product category with the most reviews is: Books
# Group by product category and calculate the average star rating for each category
category_avg_star_rating = amazon_reviews.groupBy('product_category').agg(F.avg('star_rating').alias('avg_star_rating'))
category_avg_star_rating.orderBy(F.desc('avg_star_rating')).first()
Row(product_category='Gift Card', avg_star_rating=4.731363105858364)
# Find the category with the highest average star rating
highest_rating_category = category_avg_star_rating.orderBy(F.desc('avg_star_rating')).first()['product_category']
category_avg_star_rating.orderBy('avg_star_rating').first()
Row(product_category='Digital_Software', avg_star_rating=3.5393303553935973)
# Find the category with the lowest average star rating
lowest_rating_category = category_avg_star_rating.orderBy('avg_star_rating').first()['product_category']
# Print the results
print("The product category with the highest average star rating is:", highest_rating_category)
print("The product category with the lowest average star rating is:", lowest_rating_category)
The product category with the highest average star rating is: Gift Card The product category with the lowest average star rating is: Digital_Software
# Extract year and month from the review_date column
amazon_reviews = amazon_reviews.withColumn('year', F.year(F.to_date('review_date')))
amazon_reviews = amazon_reviews.withColumn('month', F.month(F.to_date('review_date')))
amazon_reviews.printSchema()
root |-- marketplace: string (nullable = true) |-- customer_id: integer (nullable = true) |-- review_id: string (nullable = true) |-- product_id: string (nullable = true) |-- product_parent: integer (nullable = true) |-- product_title: string (nullable = true) |-- product_category: string (nullable = true) |-- star_rating: integer (nullable = true) |-- helpful_votes: integer (nullable = true) |-- total_votes: integer (nullable = true) |-- vine: string (nullable = true) |-- verified_purchase: string (nullable = true) |-- review_headline: string (nullable = true) |-- review_body: string (nullable = true) |-- review_date: string (nullable = true) |-- year: integer (nullable = true) |-- month: integer (nullable = true)
amazon_reviews.limit(5)
| marketplace | customer_id | review_id | product_id | product_parent | product_title | product_category | star_rating | helpful_votes | total_votes | vine | verified_purchase | review_headline | review_body | review_date | year | month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| US | 5579082 | R2XM6N8JQ14XEA | B00ECKHCN2 | 690725582 | FastTouch(TM) 7 I... | PC | 5 | 0 | 0 | N | Y | we love it | I am glad i purch... | 2014-02-27 | 2014 | 2 |
| US | 3204579 | RGBQQC1358UL5 | B0015M1OAW | 556506525 | DVD-multi Drive C... | PC | 2 | 1 | 1 | N | Y | review | The item works we... | 2014-02-27 | 2014 | 2 |
| US | 19807287 | R2IAME3FVQ71ES | B00BR5FZPC | 268206351 | SquareTrade Table... | PC | 5 | 0 | 0 | N | Y | they are great!!!... | i love them!!!!!!... | 2014-02-27 | 2014 | 2 |
| US | 41213197 | R38SOZM2NAZNM3 | B004JXVOD6 | 331632533 | Kindle Leather Co... | PC | 4 | 0 | 0 | N | Y | Great price | This is a good pr... | 2014-02-27 | 2014 | 2 |
| US | 43589513 | R2D710Q1C4KFR2 | B0058DIQPK | 572088316 | Kindle Touch, Wi-... | PC | 3 | 0 | 0 | N | Y | poor customer ser... | I Had to do all t... | 2014-02-27 | 2014 | 2 |
# Group by product category, year, and month, and count the number of reviews
reviews_by_category_year_month = amazon_reviews.groupBy('product_category', 'year', 'month').count()
# reviews_by_category_year_month.limit(5)
# Order by year and month within each product category
reviews_by_category_year_month_ordered = reviews_by_category_year_month.orderBy('product_category', 'year', 'month')
# Display the ordered result
reviews_by_category_year_month_ordered.show()
[Stage 35:=====================================================>(596 + 2) / 598]
+----------------+----+-----+-----+ |product_category|year|month|count| +----------------+----+-----+-----+ | Apparel|2000| 9| 1| | Apparel|2000| 10| 4| | Apparel|2000| 11| 1| | Apparel|2001| 10| 4| | Apparel|2001| 12| 1| | Apparel|2002| 1| 8| | Apparel|2002| 2| 20| | Apparel|2002| 3| 2| | Apparel|2002| 4| 1| | Apparel|2002| 5| 1| | Apparel|2002| 6| 1| | Apparel|2002| 8| 10| | Apparel|2002| 9| 22| | Apparel|2002| 10| 67| | Apparel|2002| 11| 442| | Apparel|2002| 12| 333| | Apparel|2003| 1| 195| | Apparel|2003| 2| 133| | Apparel|2003| 3| 153| | Apparel|2003| 4| 145| +----------------+----+-----+-----+ only showing top 20 rows
# Draw line chart to visualize the result for Gift Card - each year is a line and x axis is month.
import pandas as pd
import matplotlib.pyplot as plt
# Convert the Spark DataFrame to a Pandas DataFrame
reviews_pd = reviews_by_category_year_month_ordered.toPandas()
# Filter data for the 'Gift Card' category
gift_card_reviews = reviews_pd[reviews_pd['product_category'] == 'Gift Card']
gift_card_reviews
| product_category | year | month | count | |
|---|---|---|---|---|
| 2129 | Gift Card | 2004 | 10 | 1 |
| 2130 | Gift Card | 2004 | 11 | 2 |
| 2131 | Gift Card | 2004 | 12 | 1 |
| 2132 | Gift Card | 2005 | 1 | 2 |
| 2133 | Gift Card | 2005 | 2 | 4 |
| ... | ... | ... | ... | ... |
| 2236 | Gift Card | 2015 | 4 | 4155 |
| 2237 | Gift Card | 2015 | 5 | 3933 |
| 2238 | Gift Card | 2015 | 6 | 4281 |
| 2239 | Gift Card | 2015 | 7 | 4346 |
| 2240 | Gift Card | 2015 | 8 | 4021 |
112 rows × 4 columns
# Plotting
plt.figure(figsize=(10, 6))
for year in gift_card_reviews['year'].unique():
data = gift_card_reviews[gift_card_reviews['year'] == year]
plt.plot(data['month'], data['count'], label=year)
plt.xlabel('Month')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews for Gift Card by Year and Month')
plt.legend()
plt.grid(True)
plt.show()
Apparel: December
Automotive: August
Baby: December
Beauty: December
Books: December
Camera: December
Digital_ebook_purchase: March, December
Digital_music_purchase: March, December
Digital_software: March
Digital_video_download: August, December
Digital_video_games: March, December
Electronics: December
Furniture: March, December
Gift card: December
Grocery: March December
Health & personal care: December
home: December
home entertainment: December
home improvement: December
Jewelry: December
Kitchen: December
Lawn and Garden: July
Luggage: December
Major: December
Mobile apps: March, December
Mobile electronics: December
Music: December
Musical instruments: December
Office products: December
Outdoors: august, December
PC: December
Personal care appliances: December
Pet products: December
shoes: December
software: December
sports: December
tools: December
toys: December
video: December
video DVD: December
video GAMES: December
watches: December
wireless: December
reviews_by_category_year_month_ordered.select('product_category').distinct().show()
+--------------------+ | product_category| +--------------------+ | PC| | Major Appliances| | Lawn and Garden| | Kitchen| | Home Entertainment| | Home Improvement| | Home| | Wireless| | Video| | Digital_Video_Games| |Digital_Video_Dow...| | Luggage| | Video DVD| | Sports| | Mobile_Electronics| |Digital_Ebook_Pur...| | Baby| | Pet Products| | Apparel| | Grocery| +--------------------+ only showing top 20 rows
reviews_by_category_year_month_ordered.select('product_category').distinct().count()
43
# Get all unique categories
unique_categories = reviews_pd['product_category'].unique()
# Plotting
for category in unique_categories:
plt.figure(figsize=(10, 6))
plt.title(f'Number of Reviews for {category} by Year and Month')
plt.xlabel('Month')
plt.ylabel('Number of Reviews')
plt.grid(True)
# Filter data for the current category
category_reviews = reviews_pd[reviews_pd['product_category'] == category]
# Iterate over each unique year
for year in category_reviews['year'].unique():
data = category_reviews[category_reviews['year'] == year]
plt.plot(data['month'], data['count'], label=year)
plt.legend()
plt.show()
amazon_reviews.limit(5)
| marketplace | customer_id | review_id | product_id | product_parent | product_title | product_category | star_rating | helpful_votes | total_votes | vine | verified_purchase | review_headline | review_body | review_date | year | month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| US | 50788169 | R32GSWREY3K5DJ | B00CYQP3AK | 270050908 | Kindle Fire HDX 7... | PC | 5 | 1 | 1 | N | Y | Kindle Fire 7" HDX | Love the new kind... | 2013-11-24 | 2013 | 11 |
| US | 5697703 | R1QYOS3CL7YTT5 | B008D4X9UI | 238837269 | BUFFALO Technolog... | PC | 5 | 0 | 0 | N | Y | Very good | Thunderbolt and U... | 2013-11-24 | 2013 | 11 |
| US | 52281995 | R2TYX1LZF2VE43 | B007IL6OR2 | 529586088 | Linksys N750 Wi-F... | PC | 5 | 0 | 0 | N | Y | First rate! | Does exactly what... | 2013-11-24 | 2013 | 11 |
| US | 21531416 | RXPZW76NPYE4F | B00FC7DOCW | 580267817 | CS500P | PC | 5 | 1 | 1 | N | N | My thoughts | The creative Intu... | 2013-11-24 | 2013 | 11 |
| US | 28312853 | R29CDXFUHCPI7Q | B004Y3V2Q0 | 462236007 | SwitchEasy Canvas... | PC | 4 | 0 | 0 | N | Y | Elegant protectio... | It looks modestly... | 2013-11-24 | 2013 | 11 |
type(amazon_reviews)
pyspark.sql.dataframe.DataFrame
amazon_reviews.printSchema()
root |-- marketplace: string (nullable = true) |-- customer_id: integer (nullable = true) |-- review_id: string (nullable = true) |-- product_id: string (nullable = true) |-- product_parent: integer (nullable = true) |-- product_title: string (nullable = true) |-- product_category: string (nullable = true) |-- star_rating: integer (nullable = true) |-- helpful_votes: integer (nullable = true) |-- total_votes: integer (nullable = true) |-- vine: string (nullable = true) |-- verified_purchase: string (nullable = true) |-- review_headline: string (nullable = true) |-- review_body: string (nullable = true) |-- review_date: string (nullable = true) |-- year: integer (nullable = true) |-- month: integer (nullable = true)
from pyspark.sql.functions import desc
# For most helpful votes
most_helpful_reviews = amazon_reviews.orderBy(desc("helpful_votes")).limit(10)
# For most total votes
most_voted_reviews = amazon_reviews.orderBy(desc("total_votes")).limit(10)
# Show the results
most_helpful_reviews.show()
most_voted_reviews.show()
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+-----+ |marketplace|customer_id| review_id|product_id|product_parent| product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase| review_headline| review_body|review_date|year|month| +-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+-----+ | US| 52994074|R3R24QH3CDS83N|B000FI73MA| 796637267|Kindle: Amazon's ...| PC| 5| 47524| 48362| N| Y|Why and how the K...|This is less a \\...| 2007-11-25|2007| 11| | US| 50268169|R19XO9PS38WRWO|B004F9QBE6| 596552794|BIC Cristal For H...| Office Products| 5| 41393| 41866| N| N| FINALLY!|Someone has answe...| 2012-08-24|2012| 8| | US| 51121444|R2XKMDXZHQ26YX|B000OE2OLU| 771086291|The Mountain Kids...| Apparel| 5| 41278| 41889| N| N|Dual Function Design|This item has wol...| 2008-11-11|2008| 11| | US| 53025525|R2YVZNKUMWGYJ4|B003FSUDM4| 57096051|Kindle Keyboard 3...| PC| 5| 31924| 32373| N| Y|Kindle vs. Nook (...|UPDATE NOVEMBER 2...| 2010-08-28|2010| 8| | US| 12267963|R1PL1U31XQG8KG|B0083PWAPW| 270935683|Kindle Fire HD 7"...| PC| 4| 31417| 32166| N| Y|You Get What You ...|I've been an iPad...| 2012-09-17|2012| 9| | US| 51835976| RBUNL9QWDYTD7|B0051VVOB2| 341475382|Kindle Fire (Prev...| PC| 4| 28611| 29433| N| Y|A great device WH...|UPDATE November 2...| 2011-11-14|2011| 11| | US| 16378095|R2JF7E91JJVHAT|0345803485| 600633062|Fifty Shades of G...| Books| 2| 27550| 28727| N| Y|Did a teenager wr...|I really don't li...| 2012-04-15|2012| 4| | US| 9286343|R3PG4OX6C5KVN4|B000IZGIA8| 72888427|Wheelmate Laptop ...| Automotive| 5| 26132| 26382| N| N|Perfect for an St...|My husband Brad a...| 2013-04-17|2013| 4| | US| 50629044| RV0R3AODMRNJZ|B00154JDAI| 419585077|Kindle Wireless R...| PC| 1| 24714| 26143| N| Y|BEWARE of the SIG...|I was DELIGHTED t...| 2009-03-14|2009| 3| | US| 47364506| RXXPVOUH9NLL3|B00032G1S0| 753469671|Tuscan Dairy Whol...| Grocery| 5| 23755| 24170| N| N|Make this your on...|Once upon a mid-d...| 2008-07-08|2008| 7| +-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+-----+
[Stage 76:=====================================================>(597 + 1) / 598]
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+-----+ |marketplace|customer_id| review_id|product_id|product_parent| product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase| review_headline| review_body|review_date|year|month| +-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+-----+ | US| 52994074|R3R24QH3CDS83N|B000FI73MA| 796637267|Kindle: Amazon's ...| PC| 5| 47524| 48362| N| Y|Why and how the K...|This is less a \\...| 2007-11-25|2007| 11| | US| 51121444|R2XKMDXZHQ26YX|B000OE2OLU| 771086291|The Mountain Kids...| Apparel| 5| 41278| 41889| N| N|Dual Function Design|This item has wol...| 2008-11-11|2008| 11| | US| 50268169|R19XO9PS38WRWO|B004F9QBE6| 596552794|BIC Cristal For H...| Office Products| 5| 41393| 41866| N| N| FINALLY!|Someone has answe...| 2012-08-24|2012| 8| | US| 53025525|R2YVZNKUMWGYJ4|B003FSUDM4| 57096051|Kindle Keyboard 3...| PC| 5| 31924| 32373| N| Y|Kindle vs. Nook (...|UPDATE NOVEMBER 2...| 2010-08-28|2010| 8| | US| 12267963|R1PL1U31XQG8KG|B0083PWAPW| 270935683|Kindle Fire HD 7"...| PC| 4| 31417| 32166| N| Y|You Get What You ...|I've been an iPad...| 2012-09-17|2012| 9| | US| 51835976| RBUNL9QWDYTD7|B0051VVOB2| 341475382|Kindle Fire (Prev...| PC| 4| 28611| 29433| N| Y|A great device WH...|UPDATE November 2...| 2011-11-14|2011| 11| | US| 16378095|R2JF7E91JJVHAT|0345803485| 600633062|Fifty Shades of G...| Books| 2| 27550| 28727| N| Y|Did a teenager wr...|I really don't li...| 2012-04-15|2012| 4| | US| 9286343|R3PG4OX6C5KVN4|B000IZGIA8| 72888427|Wheelmate Laptop ...| Automotive| 5| 26132| 26382| N| N|Perfect for an St...|My husband Brad a...| 2013-04-17|2013| 4| | US| 50629044| RV0R3AODMRNJZ|B00154JDAI| 419585077|Kindle Wireless R...| PC| 1| 24714| 26143| N| Y|BEWARE of the SIG...|I was DELIGHTED t...| 2009-03-14|2009| 3| | US| 47364506| RXXPVOUH9NLL3|B00032G1S0| 753469671|Tuscan Dairy Whol...| Grocery| 5| 23755| 24170| N| N|Make this your on...|Once upon a mid-d...| 2008-07-08|2008| 7| +-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+-----+
from pyspark.sql.functions import split, size, col
from pyspark.sql.types import IntegerType
from pyspark.sql import functions as F
# Split review_headline into words and count the number of words
amazon_reviews = amazon_reviews.withColumn("headline_length", size(split(col("review_headline"), " ")).cast(IntegerType()))
# Calculate correlation between headline length and helpful votes
correlation = amazon_reviews.corr("headline_length", "helpful_votes")
# Show correlation
print("Correlation between headline length and helpful votes:", correlation)
[Stage 78:=====================================================>(594 + 4) / 598]
Correlation between headline length and helpful votes: 0.030980526018484132
from pyspark.sql.functions import split, size, col
from pyspark.sql.types import IntegerType
# Split review_body into words and count the number of words
amazon_reviews = amazon_reviews.withColumn("body_length", size(split(col("review_body"), " ")).cast(IntegerType()))
# Calculate correlation between body length and helpful votes
correlation = amazon_reviews.corr("body_length", "helpful_votes")
# Show correlation
print("Correlation between body length and helpful votes:", correlation)
[Stage 80:=====================================================>(597 + 1) / 598]
Correlation between body length and helpful votes: 0.13785942999384293
# Group DataFrame by customer_id and count the number of reviews published by each customer
customer_review_count = amazon_reviews.groupBy("customer_id").count()
# Rename the count column to something more descriptive
customer_review_count = customer_review_count.withColumnRenamed("count", "review_count")
# Join the customer_review_count DataFrame with the original DataFrame
amazon_reviews_with_count = amazon_reviews.join(customer_review_count, "customer_id", "inner")
# Calculate correlation between review count and helpful votes
correlation = amazon_reviews_with_count.corr("review_count", "helpful_votes")
# Show correlation
print("Correlation between review count and helpful votes:", correlation)
[Stage 86:=======================================================>(66 + 1) / 67]
Correlation between review count and helpful votes: 0.007543638017878737
In Spark, computational effectiveness is achieved through distributed computing across a cluster of machines. Spark efficiently distributes data processing tasks across the cluster and optimizes the execution of these tasks through various mechanisms. Here's how Spark achieves computational effectiveness:
Distributed Data Processing: Spark distributes data across the cluster in partitions, allowing parallel processing of data. Each partition of data is processed independently on different nodes in the cluster, enabling high throughput and scalability.
In-Memory Computation: Spark utilizes in-memory computation to cache intermediate data in memory, reducing the need to read from disk repeatedly. By keeping data in memory, Spark minimizes the latency associated with accessing data from disk, resulting in faster processing times.
Lazy Evaluation: Spark employs lazy evaluation, meaning that transformations on the data are not executed immediately. Instead, Spark builds up a directed acyclic graph (DAG) representing the sequence of transformations to be applied to the data. This allows Spark to optimize the execution plan by combining and reordering transformations before executing them, improving performance.
Task Pipelining: Spark pipelines tasks together to minimize data shuffling and reduce overhead. Tasks that can be executed together are grouped into stages, and stages are executed one after another with minimal data movement between them, reducing communication overhead and improving efficiency.
Partitioning and Data Locality: Spark optimizes data partitioning and scheduling to ensure that data processing tasks are executed on nodes where the data resides (data locality). This minimizes data movement across the network and maximizes the utilization of available resources, improving computational efficiency.
Distributed Data Structures and Operations: Spark provides distributed data structures such as RDDs (Resilient Distributed Datasets) and DataFrames, along with a rich set of parallelized operations (transformations and actions) to process data efficiently in parallel across the cluster.
Optimized Shuffle Operations: Spark optimizes shuffle operations, such as groupBy, join, and sortBy, to minimize data shuffling and reduce network traffic. Spark performs partition-aware shuffle, where data is shuffled only within partitions, reducing the amount of data transferred across the network.
By leveraging these techniques, Spark achieves computational effectiveness, enabling scalable, high-performance data processing for a wide range of applications.